The National Basketball Association(NBA) is a men’s professional basketball league in North America composed of 30 teams. With David Stern’s(the fourth NBA Commissioner) great efforts, NBA turns into the sport of the modern world from an unknown commodity outside the United States. Beside its business modes and fame, NBA’s games are going through great revolutions during last 15 years. 1 2
Today we are able to analyze the teams’ and players’ performance from different angles using their gaming data instead of simply watching video records, which provides us more ways to learn and enjoy the basketball games.
In this report, we are going to use data to find out how NBA changes for the past 15 years. We are curious about how the overall stragety of teams changes as well as how players adapt to these changes.
Chao Yin is mainy responible for collection of team/player game stats data while Zeyu Yang is responsible for players’ biographical information.
Our data is collected from Basketball-Reference, Stats NBA and Kaggle.
Basketball Reference is a site providing both basic and sabermetric statistics and resources for basketball fans using offical NBA data.
Stats NBA is the home of NBA Advanced Stats and provides official NBA Statistics and advanced analytics.
Kaggle is an online community that allows users to find and publish data sets.
Data in Basketball-Reference is stored in XML so that we can directly extract them using packages XML and RCurl. However, some tables on this site are commented and they can only be downloaded manually in csv form thus we choose Stats NBA for other data. It’s a bit harder to extract data tables from Stats NBA than from Basketball-Reference since they are stored in json files. We use statsnbaR which provides utility functions to download data from the API end-points of Stats NBA. We got teams from Basketball -Reference and players from Stats NBA.
Kaggle is the source of player’s biographical data. The aforementioned two sites can also provide the same data but the data is harded to collect since it is not stored in tables.
players datasets contains all regular season information of all players in one season.
General data provides basic players’ performance including:
Profile information like Name, Team, Age, Game Played, Minutes Played, etc.
Shooting performance from 2 pointer, 3 pointer and free throw like Field Goalds Made, Field Goals Attempted, Field Goal Percentage, etc.
Basic stats per game like Rebounds, Assists, Steals, Blocks, Points, Turnovers, Personal Fouls, etc.
Advanced data measures and analysis player’s ability in one percific area :
Overall ratings like Offensive Rating, Defensive Rating, Net Rating, Player Impact Estimate, Usage Percentage, etc.
Passing/Assist ability like Assist Percentage, Assist to Turnover Ratio, Assist Ratio
Rebound ability like Offensive Rebound Percentage, Defensive Rebound Percentage, Rebound Percentage
Shooting ability like Effective Field Goal Percentage, True Shooting Percentage
Bio dataset contains players’ biographical data:
The year player starts playing at NBA and the year he retires
Height and weight data
Birth date
College attended
teams datasets contains similar information as shown in the players but corresponds to each team in the league. However, teams provides ways to split the data in order to measure the teams’ performance from different angles:
Location helps measure teams’ gaming performance at home or on the road respectively
Wins-Losses tells how the team played when they won or losed the game
Month and Pre/Post All Stars give teams’ performance changes over time periods
Days Rest tests teams abilities to handle tough schedules
Teams in NBA keep changing in these 15 years. Three teams change their team locations and team names thus we may find the teams are not necessarily the same each year. Players can be traded and signed during the season, which makes some players have more records than others in these datasets.
Height data in bio dataset is saved as character,such as “6-8”, which requires us to convert them to numeric.
Also all data are saved as factor, which requires us to convert them to numeric or character.
After we got all the raw data in data/raw, we wanted to combine them into four datasets: Team_splits, Team_shoots, Player & Players_bio.
For the players’ data, we first remove empty rows and columns and turn the variables into numerics and characters according to their content. Considering more and more players can play more than one position today, we group the players into three kinds: Guards, Wings and Bigs instead of the origin positions they play. And finally we combind players data of all 15 years and got Player.
* Scroll down the table to see more details
print(dfSummary(Player,
headings = FALSE,
plain.ascii = FALSE,
valid.col = FALSE,
graph.magnif = 0.75,
style = "grid",
max.distinct.values = 5,
varnumbers = FALSE),
max.tbl.height = 500,method='render')
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Player [character] | 1. Kyle Korver 2. Devin Harris 3. Pau Gasol 4. Trevor Ariza 5. Vince Carter [ 1640 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| Pos [factor] | 1. Guards 2. Wings 3. Bigs |
|
0 (0%) | |||||||||||||||||||||||||
| Age [numeric] | Mean (sd) : 26.6 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 6 (0.2) | 26 distinct values | 0 (0%) | |||||||||||||||||||||||||
| Tm [character] | 1. HOU 2. CLE 3. MEM 4. NYK 5. LAC [ 30 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| G [numeric] | Mean (sd) : 46.6 (26.6) min < med < max: 1 < 51 < 82 IQR (CV) : 49 (0.6) | 82 distinct values | 0 (0%) | |||||||||||||||||||||||||
| GS [numeric] | Mean (sd) : 22.6 (27.9) min < med < max: 0 < 7 < 82 IQR (CV) : 41.2 (1.2) | 83 distinct values | 0 (0%) | |||||||||||||||||||||||||
| MP [numeric] | Mean (sd) : 19.8 (10) min < med < max: 0 < 19.2 < 43.1 IQR (CV) : 16.4 (0.5) | 410 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FG [numeric] | Mean (sd) : 3 (2.1) min < med < max: 0 < 2.5 < 12.2 IQR (CV) : 2.9 (0.7) | 111 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FGA [numeric] | Mean (sd) : 6.7 (4.5) min < med < max: 0 < 5.6 < 27.2 IQR (CV) : 6.3 (0.7) | 228 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FG% [numeric] | Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) | 452 distinct values | 52 (0.6%) | |||||||||||||||||||||||||
| 3P [numeric] | Mean (sd) : 0.6 (0.7) min < med < max: 0 < 0.3 < 5.1 IQR (CV) : 1 (1.2) | 44 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 3PA [numeric] | Mean (sd) : 1.7 (1.8) min < med < max: 0 < 1.1 < 13.2 IQR (CV) : 2.7 (1.1) | 95 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 3P% [numeric] | Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) | 376 distinct values | 1335 (15.53%) | |||||||||||||||||||||||||
| 2P [numeric] | Mean (sd) : 2.4 (1.9) min < med < max: 0 < 1.9 < 10.3 IQR (CV) : 2.4 (0.8) | 99 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 2PA [numeric] | Mean (sd) : 5 (3.7) min < med < max: 0 < 3.9 < 22.2 IQR (CV) : 4.8 (0.7) | 198 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 2P% [numeric] | Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) | 444 distinct values | 94 (1.09%) | |||||||||||||||||||||||||
| eFG% [numeric] | Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) | 468 distinct values | 52 (0.6%) | |||||||||||||||||||||||||
| FT [numeric] | Mean (sd) : 1.4 (1.4) min < med < max: 0 < 1 < 10.3 IQR (CV) : 1.4 (1) | 92 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FTA [numeric] | Mean (sd) : 1.9 (1.7) min < med < max: 0 < 1.4 < 11.7 IQR (CV) : 1.9 (0.9) | 112 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FT% [numeric] | Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) | 576 distinct values | 438 (5.1%) | |||||||||||||||||||||||||
| ORB [numeric] | Mean (sd) : 0.9 (0.8) min < med < max: 0 < 0.7 < 6 IQR (CV) : 1 (0.9) | 54 distinct values | 0 (0%) | |||||||||||||||||||||||||
| DRB [numeric] | Mean (sd) : 2.6 (1.8) min < med < max: 0 < 2.2 < 12 IQR (CV) : 2.1 (0.7) | 111 distinct values | 0 (0%) | |||||||||||||||||||||||||
| TRB [numeric] | Mean (sd) : 3.5 (2.5) min < med < max: 0 < 2.9 < 18 IQR (CV) : 2.9 (0.7) | 148 distinct values | 0 (0%) | |||||||||||||||||||||||||
| AST [numeric] | Mean (sd) : 1.8 (1.8) min < med < max: 0 < 1.2 < 12.8 IQR (CV) : 1.8 (1) | 113 distinct values | 0 (0%) | |||||||||||||||||||||||||
| STL [numeric] | Mean (sd) : 0.6 (0.4) min < med < max: 0 < 0.5 < 2.9 IQR (CV) : 0.6 (0.7) | 30 distinct values | 0 (0%) | |||||||||||||||||||||||||
| BLK [numeric] | Mean (sd) : 0.4 (0.5) min < med < max: 0 < 0.2 < 6 IQR (CV) : 0.4 (1.2) | 39 distinct values | 0 (0%) | |||||||||||||||||||||||||
| TOV [numeric] | Mean (sd) : 1.1 (0.8) min < med < max: 0 < 1 < 5.7 IQR (CV) : 0.9 (0.7) | 51 distinct values | 0 (0%) | |||||||||||||||||||||||||
| PF [numeric] | Mean (sd) : 1.8 (0.8) min < med < max: 0 < 1.8 < 6 IQR (CV) : 1.2 (0.5) | 46 distinct values | 0 (0%) | |||||||||||||||||||||||||
| PTS [numeric] | Mean (sd) : 8 (5.9) min < med < max: 0 < 6.5 < 36.1 IQR (CV) : 7.9 (0.7) | 301 distinct values | 0 (0%) | |||||||||||||||||||||||||
| PER [numeric] | Mean (sd) : 12.7 (6.1) min < med < max: -54.4 < 12.6 < 133.8 IQR (CV) : 6.1 (0.5) | 412 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| TS% [numeric] | Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) | 481 distinct values | 25 (0.29%) | |||||||||||||||||||||||||
| 3PAr [numeric] | Mean (sd) : 0.2 (0.2) min < med < max: 0 < 0.2 < 1 IQR (CV) : 0.4 (0.9) | 784 distinct values | 26 (0.3%) | |||||||||||||||||||||||||
| FTr [numeric] | Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 6 IQR (CV) : 0.2 (0.8) | 778 distinct values | 26 (0.3%) | |||||||||||||||||||||||||
| ORB% [numeric] | Mean (sd) : 5.5 (4.8) min < med < max: 0 < 4.1 < 100 IQR (CV) : 6.3 (0.9) | 222 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| DRB% [numeric] | Mean (sd) : 14.5 (6.5) min < med < max: 0 < 13.5 < 100 IQR (CV) : 8.5 (0.4) | 354 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| TRB% [numeric] | Mean (sd) : 10 (5) min < med < max: 0 < 9 < 86.4 IQR (CV) : 7.1 (0.5) | 265 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| AST% [numeric] | Mean (sd) : 12.7 (9.2) min < med < max: 0 < 9.8 < 78.5 IQR (CV) : 10.9 (0.7) | 470 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| STL% [numeric] | Mean (sd) : 1.6 (0.9) min < med < max: 0 < 1.5 < 12.5 IQR (CV) : 0.8 (0.6) | 80 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| BLK% [numeric] | Mean (sd) : 1.6 (1.7) min < med < max: 0 < 1 < 26.3 IQR (CV) : 1.7 (1.1) | 109 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| TOV% [numeric] | Mean (sd) : 13.9 (6.2) min < med < max: 0 < 13.2 < 100 IQR (CV) : 5.8 (0.4) | 341 distinct values | 21 (0.24%) | |||||||||||||||||||||||||
| USG% [numeric] | Mean (sd) : 18.6 (5.3) min < med < max: 0 < 18.2 < 53.7 IQR (CV) : 6.8 (0.3) | 334 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| OWS [numeric] | Mean (sd) : 1.3 (2) min < med < max: -3.3 < 0.6 < 14.8 IQR (CV) : 2 (1.6) | 156 distinct values | 0 (0%) | |||||||||||||||||||||||||
| DWS [numeric] | Mean (sd) : 1.2 (1.2) min < med < max: -0.6 < 0.9 < 9.1 IQR (CV) : 1.5 (1) | 80 distinct values | 0 (0%) | |||||||||||||||||||||||||
| WS [numeric] | Mean (sd) : 2.5 (2.9) min < med < max: -2.1 < 1.6 < 20.3 IQR (CV) : 3.5 (1.2) | 184 distinct values | 0 (0%) | |||||||||||||||||||||||||
| WS/48 [numeric] | Mean (sd) : 0.1 (0.1) min < med < max: -1.3 < 0.1 < 2.7 IQR (CV) : 0.1 (1.4) | 557 distinct values | 3 (0.03%) | |||||||||||||||||||||||||
| OBPM [numeric] | Mean (sd) : -1.6 (3.6) min < med < max: -46.4 < -1.4 < 68.6 IQR (CV) : 3.4 (-2.2) | 283 distinct values | 0 (0%) | |||||||||||||||||||||||||
| DBPM [numeric] | Mean (sd) : -0.4 (2.1) min < med < max: -23.1 < -0.4 < 17.1 IQR (CV) : 2.4 (-4.8) | 185 distinct values | 0 (0%) | |||||||||||||||||||||||||
| BPM [numeric] | Mean (sd) : -2 (4.3) min < med < max: -59 < -1.7 < 54.4 IQR (CV) : 4.3 (-2.1) | 334 distinct values | 0 (0%) | |||||||||||||||||||||||||
| VORP [numeric] | Mean (sd) : 0.6 (1.3) min < med < max: -2.2 < 0 < 12.4 IQR (CV) : 1.1 (2.3) | 112 distinct values | 0 (0%) | |||||||||||||||||||||||||
| year [integer] | Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) | 16 distinct values | 0 (0%) |
For Players_bio data, we join players’ data and biographical data and turn the variables into numerics and characters according to their content.
* Scroll down the table to see more details
print(dfSummary(Players_bio,
headings = FALSE,
plain.ascii = FALSE,
valid.col = FALSE,
graph.magnif = 0.75,
style = "grid",
max.distinct.values = 5,
varnumbers = FALSE),
max.tbl.height = 500,method='render')
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rk [numeric] | Mean (sd) : 239.6 (137.8) min < med < max: 1 < 239 < 540 IQR (CV) : 238 (0.6) | 540 distinct values | 0 (0%) | |||||||||||||||||||||||||
| Player [character] | 1. Mike James 2. Mike Dunleavy 3. Chris Johnson 4. David Lee 5. Corey Brewer [ 1641 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| Pos [character] | 1. SG 2. PF 3. PG 4. C 5. SF [ 10 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| Age [numeric] | Mean (sd) : 26.6 (4.2) min < med < max: 18 < 26 < 44 IQR (CV) : 7 (0.2) | 26 distinct values | 0 (0%) | |||||||||||||||||||||||||
| Tm [character] | 1. TOT 2. HOU 3. CLE 4. NYK 5. MEM [ 31 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| G [numeric] | Mean (sd) : 46.6 (26.3) min < med < max: 1 < 51 < 85 IQR (CV) : 49 (0.6) | 85 distinct values | 0 (0%) | |||||||||||||||||||||||||
| GS [numeric] | Mean (sd) : 21.9 (27.4) min < med < max: 0 < 7 < 83 IQR (CV) : 40 (1.2) | 84 distinct values | 0 (0%) | |||||||||||||||||||||||||
| MP [numeric] | Mean (sd) : 1078 (877.4) min < med < max: 0 < 887 < 3424 IQR (CV) : 1483 (0.8) | 2828 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FG [numeric] | Mean (sd) : 166.1 (164.4) min < med < max: 0 < 114 < 978 IQR (CV) : 225 (1) | 727 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FGA [numeric] | Mean (sd) : 366.9 (352.8) min < med < max: 0 < 260 < 2173 IQR (CV) : 489 (1) | 1370 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FG% [numeric] | Mean (sd) : 0.4 (0.1) min < med < max: 0 < 0.4 < 1 IQR (CV) : 0.1 (0.2) | 458 distinct values | 53 (0.55%) | |||||||||||||||||||||||||
| 3P [numeric] | Mean (sd) : 33.1 (46.4) min < med < max: 0 < 10 < 402 IQR (CV) : 52 (1.4) | 249 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 3PA [numeric] | Mean (sd) : 93 (122.9) min < med < max: 0 < 34 < 1028 IQR (CV) : 146 (1.3) | 552 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 3P% [numeric] | Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.3 < 1 IQR (CV) : 0.2 (0.6) | 380 distinct values | 1490 (15.33%) | |||||||||||||||||||||||||
| 2P [numeric] | Mean (sd) : 133 (140.6) min < med < max: 0 < 85 < 798 IQR (CV) : 174 (1.1) | 644 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 2PA [numeric] | Mean (sd) : 273.9 (280.7) min < med < max: 0 < 182 < 1655 IQR (CV) : 350 (1) | 1140 distinct values | 0 (0%) | |||||||||||||||||||||||||
| 2P% [numeric] | Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1 IQR (CV) : 0.1 (0.2) | 446 distinct values | 98 (1.01%) | |||||||||||||||||||||||||
| eFG% [numeric] | Mean (sd) : 0.5 (0.1) min < med < max: 0 < 0.5 < 1.5 IQR (CV) : 0.1 (0.2) | 473 distinct values | 53 (0.55%) | |||||||||||||||||||||||||
| FT [numeric] | Mean (sd) : 80.1 (98.8) min < med < max: 0 < 44 < 756 IQR (CV) : 99 (1.2) | 515 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FTA [numeric] | Mean (sd) : 105.7 (125.1) min < med < max: 0 < 61 < 916 IQR (CV) : 129 (1.2) | 615 distinct values | 0 (0%) | |||||||||||||||||||||||||
| FT% [numeric] | Mean (sd) : 0.7 (0.2) min < med < max: 0 < 0.8 < 1 IQR (CV) : 0.2 (0.2) | 582 distinct values | 475 (4.89%) | |||||||||||||||||||||||||
| ORB [numeric] | Mean (sd) : 48.4 (57.1) min < med < max: 0 < 28 < 440 IQR (CV) : 56 (1.2) | 310 distinct values | 0 (0%) | |||||||||||||||||||||||||
| DRB [numeric] | Mean (sd) : 139.5 (137.5) min < med < max: 0 < 102 < 894 IQR (CV) : 174 (1) | 650 distinct values | 0 (0%) | |||||||||||||||||||||||||
| TRB [numeric] | Mean (sd) : 187.8 (188.3) min < med < max: 0 < 133 < 1247 IQR (CV) : 228 (1) | 837 distinct values | 0 (0%) | |||||||||||||||||||||||||
| AST [numeric] | Mean (sd) : 97.4 (123.3) min < med < max: 0 < 53 < 925 IQR (CV) : 117 (1.3) | 610 distinct values | 0 (0%) | |||||||||||||||||||||||||
| STL [numeric] | Mean (sd) : 33.6 (32.5) min < med < max: 0 < 24 < 217 IQR (CV) : 44 (1) | 179 distinct values | 0 (0%) | |||||||||||||||||||||||||
| BLK [numeric] | Mean (sd) : 21.2 (30.4) min < med < max: 0 < 10 < 307 IQR (CV) : 23 (1.4) | 208 distinct values | 0 (0%) | |||||||||||||||||||||||||
| TOV [numeric] | Mean (sd) : 61.4 (59.5) min < med < max: 0 < 44 < 464 IQR (CV) : 78 (1) | 304 distinct values | 0 (0%) | |||||||||||||||||||||||||
| PF [numeric] | Mean (sd) : 93.3 (71.2) min < med < max: 0 < 83 < 332 IQR (CV) : 117 (0.8) | 304 distinct values | 0 (0%) | |||||||||||||||||||||||||
| PTS [numeric] | Mean (sd) : 445.5 (448.9) min < med < max: 0 < 303 < 2832 IQR (CV) : 600 (1) | 1656 distinct values | 0 (0%) | |||||||||||||||||||||||||
| Year [numeric] | Mean (sd) : 2011.7 (4.7) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) | 16 distinct values | 0 (0%) | |||||||||||||||||||||||||
| year_start [integer] | Mean (sd) : 2006.5 (6.7) min < med < max: 1952 < 2007 < 2018 IQR (CV) : 10 (0) | 44 distinct values | 275 (2.83%) | |||||||||||||||||||||||||
| year_end [integer] | Mean (sd) : 2014.3 (5) min < med < max: 1958 < 2016 < 2018 IQR (CV) : 6 (0) | 31 distinct values | 275 (2.83%) | |||||||||||||||||||||||||
| position [character] | 1. G 2. F 3. C 4. F-C 5. G-F [ 2 others ] |
|
275 (2.83%) | |||||||||||||||||||||||||
| height [numeric] | Mean (sd) : 200.6 (9.1) min < med < max: 165.1 < 200.7 < 228.6 IQR (CV) : 15.2 (0) | 22 distinct values | 275 (2.83%) | |||||||||||||||||||||||||
| weight [integer] | Mean (sd) : 219.8 (26.9) min < med < max: 135 < 220 < 360 IQR (CV) : 40 (0.1) | 120 distinct values | 275 (2.83%) | |||||||||||||||||||||||||
| birth_date [character] | 1. June 26, 1984 2. June 1, 1985 3. March 25, 1986 4. May 19, 1976 5. August 17, 1986 [ 1411 others ] |
|
275 (2.83%) | |||||||||||||||||||||||||
| college [character] | 1. 2. University of Kentucky 3. Duke University 4. University of North Carol 5. University of California, [ 229 others ] |
|
275 (2.83%) |
For teams’ data, we split them into two datasets Team_split and Team_shooting.
Teams_splits contains all the ‘per game’ stats for each 30 team every season. We choose ‘Location’ filter because all the teams have to play 41 Home game and 41 Road games every year and we simply calculate the mean to get seasonal average stats. We changed the format, removed the ranking variables, combined the basic with advanced data, and put all 15 years data into this one dataset.
* Scroll down the table to see more details
print(dfSummary(Team_splits,
headings = FALSE,
plain.ascii = FALSE,
valid.col = FALSE,
graph.magnif = 0.75,
style = "grid",
max.distinct.values = 5,
varnumbers = FALSE),
max.tbl.height = 500,method='render')
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| team [character] | 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks [ 31 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| pctWins [numeric] | Mean (sd) : 0.5 (0.2) min < med < max: 0.1 < 0.5 < 0.9 IQR (CV) : 0.2 (0.3) | 115 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fgm [numeric] | Mean (sd) : 37.5 (2.1) min < med < max: 32.4 < 37.3 < 44 IQR (CV) : 2.7 (0.1) | 168 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fga [numeric] | Mean (sd) : 82.5 (3.6) min < med < max: 74.2 < 82.2 < 94 IQR (CV) : 5.1 (0) | 220 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctFG [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.5 IQR (CV) : 0 (0) | 125 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fg3m [numeric] | Mean (sd) : 7.4 (2.3) min < med < max: 2.8 < 7 < 16.1 IQR (CV) : 3 (0.3) | 164 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fg3a [numeric] | Mean (sd) : 20.7 (6.1) min < med < max: 8.2 < 19.5 < 45.3 IQR (CV) : 8.3 (0.3) | 294 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctFG3 [numeric] | Mean (sd) : 0.4 (0) min < med < max: 0.3 < 0.4 < 0.4 IQR (CV) : 0 (0.1) | 478 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctFT [numeric] | Mean (sd) : 0.8 (0) min < med < max: 0.7 < 0.8 < 0.8 IQR (CV) : 0 (0) | 206 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fg2m [numeric] | Mean (sd) : 30.1 (1.9) min < med < max: 23.1 < 30.2 < 35.2 IQR (CV) : 2.4 (0.1) | 151 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fg2a [numeric] | Mean (sd) : 61.8 (4.6) min < med < max: 41.9 < 62.1 < 74.3 IQR (CV) : 6.1 (0.1) | 253 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctFG2 [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) | 479 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ftm [numeric] | Mean (sd) : 18.2 (2) min < med < max: 12.2 < 18.1 < 24.1 IQR (CV) : 2.6 (0.1) | 153 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fta [numeric] | Mean (sd) : 24 (2.6) min < med < max: 16.6 < 23.9 < 31.6 IQR (CV) : 3.3 (0.1) | 196 distinct values | 0 (0%) | |||||||||||||||||||||||||
| oreb [numeric] | Mean (sd) : 11 (1.3) min < med < max: 7.6 < 10.9 < 14.6 IQR (CV) : 1.7 (0.1) | 113 distinct values | 0 (0%) | |||||||||||||||||||||||||
| dreb [numeric] | Mean (sd) : 31.5 (2.1) min < med < max: 26.9 < 31.2 < 40.5 IQR (CV) : 3 (0.1) | 159 distinct values | 0 (0%) | |||||||||||||||||||||||||
| treb [numeric] | Mean (sd) : 42.4 (2) min < med < max: 36.8 < 42.2 < 49.7 IQR (CV) : 2.7 (0) | 154 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ast [numeric] | Mean (sd) : 21.9 (2) min < med < max: 17.4 < 21.6 < 30.4 IQR (CV) : 2.6 (0.1) | 157 distinct values | 0 (0%) | |||||||||||||||||||||||||
| tov [numeric] | Mean (sd) : 14.4 (1.1) min < med < max: 11.2 < 14.4 < 17.7 IQR (CV) : 1.4 (0.1) | 106 distinct values | 0 (0%) | |||||||||||||||||||||||||
| stl [numeric] | Mean (sd) : 7.5 (0.9) min < med < max: 5.5 < 7.5 < 10 IQR (CV) : 1.1 (0.1) | 81 distinct values | 0 (0%) | |||||||||||||||||||||||||
| blk [numeric] | Mean (sd) : 4.9 (0.8) min < med < max: 2.4 < 4.8 < 8.2 IQR (CV) : 1 (0.2) | 78 distinct values | 0 (0%) | |||||||||||||||||||||||||
| blka [numeric] | Mean (sd) : 4.9 (0.7) min < med < max: 3 < 4.9 < 6.9 IQR (CV) : 0.9 (0.1) | 71 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pf [numeric] | Mean (sd) : 20.9 (1.7) min < med < max: 16.6 < 20.8 < 26.7 IQR (CV) : 2.4 (0.1) | 137 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pts [numeric] | Mean (sd) : 100.5 (5.9) min < med < max: 85.5 < 99.7 < 118.2 IQR (CV) : 7.6 (0.1) | 296 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pfd [numeric] | Mean (sd) : 19.5 (5.1) min < med < max: 0 < 20.4 < 25.6 IQR (CV) : 2.2 (0.3) | 119 distinct values | 32 (6.68%) | |||||||||||||||||||||||||
| pctAST [numeric] | Mean (sd) : 0.6 (0) min < med < max: 0.5 < 0.6 < 0.7 IQR (CV) : 0.1 (0.1) | 237 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctOREB [numeric] | Mean (sd) : 0.3 (0) min < med < max: 0.2 < 0.3 < 0.4 IQR (CV) : 0 (0.1) | 191 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctDREB [numeric] | Mean (sd) : 0.7 (0) min < med < max: 0.7 < 0.7 < 0.8 IQR (CV) : 0 (0) | 174 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctTREB [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.5 IQR (CV) : 0 (0) | 119 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctTOVTeam [numeric] | Mean (sd) : 0.2 (0) min < med < max: 0.1 < 0.2 < 0.2 IQR (CV) : 0 (0.1) | 112 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctEFG [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0) | 172 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctTS [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.5 < 0.5 < 0.6 IQR (CV) : 0 (0) | 151 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ortgE [numeric] | Mean (sd) : 104.1 (3.7) min < med < max: 92.3 < 103.9 < 113.9 IQR (CV) : 5.4 (0) | 227 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ortg [numeric] | Mean (sd) : 105.7 (3.7) min < med < max: 94.4 < 105.3 < 114.9 IQR (CV) : 5.1 (0) | 232 distinct values | 0 (0%) | |||||||||||||||||||||||||
| drtgE [numeric] | Mean (sd) : 104.1 (3.6) min < med < max: 91.6 < 104.2 < 115.1 IQR (CV) : 5.1 (0) | 229 distinct values | 0 (0%) | |||||||||||||||||||||||||
| drtg [numeric] | Mean (sd) : 105.7 (3.5) min < med < max: 93.1 < 105.8 < 116.8 IQR (CV) : 4.9 (0) | 223 distinct values | 0 (0%) | |||||||||||||||||||||||||
| netrtgE [numeric] | Mean (sd) : 0 (5) min < med < max: -15.5 < 0 < 12.1 IQR (CV) : 7 (672.1) | 274 distinct values | 0 (0%) | |||||||||||||||||||||||||
| netrtg [numeric] | Mean (sd) : 0 (4.7) min < med < max: -15.1 < 0.1 < 11.4 IQR (CV) : 6.8 (420.7) | 269 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ratioASTtoTO [numeric] | Mean (sd) : 1.5 (0.2) min < med < max: 1 < 1.5 < 2.1 IQR (CV) : 0.3 (0.1) | 151 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ratioAST [numeric] | Mean (sd) : 16.8 (1.2) min < med < max: 14.1 < 16.7 < 21.2 IQR (CV) : 1.5 (0.1) | 106 distinct values | 0 (0%) | |||||||||||||||||||||||||
| paceE [numeric] | Mean (sd) : 95.7 (3.5) min < med < max: 88.6 < 95.3 < 106.5 IQR (CV) : 4.9 (0) | 227 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pace [numeric] | Mean (sd) : 94.3 (3.4) min < med < max: 87.4 < 93.9 < 104.6 IQR (CV) : 4.8 (0) | 432 distinct values | 0 (0%) | |||||||||||||||||||||||||
| ratioPIE [numeric] | Mean (sd) : 0.5 (0) min < med < max: 0.4 < 0.5 < 0.6 IQR (CV) : 0 (0.1) | 211 distinct values | 0 (0%) | |||||||||||||||||||||||||
| year [integer] | Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 7.5 (0) | 16 distinct values | 0 (0%) |
Team_shooting contains all the shooting performance of each team from different regions on the court. We cleaned them the same way as Team_splits
* Scroll down the table to see more details
print(dfSummary(Team_shooting,
headings = FALSE,
plain.ascii = FALSE,
valid.col = FALSE,
graph.magnif = 0.75,
style = "grid",
max.distinct.values = 5,
varnumbers = FALSE),
max.tbl.height = 500,method='render')
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| team [character] | 1. Atlanta Hawks 2. Boston Celtics 3. Chicago Bulls 4. Cleveland Cavaliers 5. Dallas Mavericks [ 31 others ] |
|
0 (0%) | |||||||||||||||||||||||||
| distance [character] | 1. 16-24 ft. 2. 24+ ft. 3. 8-16 ft. 4. Back Court Shot 5. Less Than 8 ft. |
|
0 (0%) | |||||||||||||||||||||||||
| fgm [numeric] | Mean (sd) : 607.1 (528.8) min < med < max: 0 < 474 < 2259 IQR (CV) : 467.5 (0.9) | 939 distinct values | 0 (0%) | |||||||||||||||||||||||||
| fga [numeric] | Mean (sd) : 1335.9 (949.2) min < med < max: 3 < 1230 < 3891 IQR (CV) : 1225 (0.7) | 1309 distinct values | 0 (0%) | |||||||||||||||||||||||||
| pctFG [numeric] | Mean (sd) : 0.3 (0.2) min < med < max: 0 < 0.4 < 0.6 IQR (CV) : 0.1 (0.5) | 293 distinct values | 0 (0%) | |||||||||||||||||||||||||
| year [integer] | Mean (sd) : 2011.5 (4.6) min < med < max: 2004 < 2012 < 2019 IQR (CV) : 8 (0) | 16 distinct values | 0 (0%) |
* To understand the meaning of all variables, please visit StatsNBA.
As we can see in the aforementioned tables, there is no missing value in Teams_splits and Team_shooting. Also, since Player and Players_bio are similar to each other, we are going to display the missing values of Players_bio here.
visna(Players_bio)
Figure 1: Missing values
The first row in Figure 1 shows that the marjority of the data has no missing values.
Those rows that have missed year_start variable also missed all the following variables. This is because these columns come from another table: bio. Although the bio table itself has no missing values, it does not contain all the players as Player data has.
Also, we can see that there are quite some rows missing 3PA values, FT values etc. These varibales are related to player’s shooting data per season. The missing values mean that these players do not shoot that season.
data <- Players_bio%>%
filter(Year>=2004)%>%
select(Player,Age,Year)%>%
distinct()%>%
as.data.frame(stringsAsFactors = F)%>%
select(Age,Year)
# ggplot(data,aes(x=as.factor(Year),y=Age))+geom_boxplot()
ggplot(data, aes(x=Age, y=Year,group=Year)) +
stat_density_ridges(quantile_lines = TRUE, quantiles = 2, fill="grey80") +
geom_text(data=data %>% group_by(Year) %>%
summarise(Age=median(Age)),
aes(label=sprintf("%1.1f", Age)),
position=position_nudge(y=-0.1), colour="#17408B", size=3)+
geom_text(data=data %>% group_by(Year) %>%
summarise(Age=min(Age)),
aes(label=sprintf("%1.1f", Age)),
position=position_nudge(y=-0.1), colour="#17408B", size=3)+
geom_text(data=data %>% group_by(Year) %>%
summarise(Age=max(Age)),
aes(label=sprintf("%1.1f", Age)),
position=position_nudge(y=-0.1), colour="#17408B", size=3)+
xlab("")+
ylab("")+
ggtitle("Age Distibution")+
scale_y_continuous(breaks=seq(2004,2019))+
scale_x_continuous(breaks=seq(15,45,3))+
theme_minimal()+
theme(
plot.title = element_text(size=17.5,face="bold"),
axis.text.x = element_text(color = "#000000", size = 11),
axis.text.y = element_text(color = "#000000", size = 11))
The ridge plot presents the distribution of the NBA players’ age by year. The x-axis is the age, the y-axix is the year, the height of each line is the probability of this particular age.
The three numbers on each distribution is min age, median age and max age(from left to right) of that season.
As we can see from the plot, the distribution of age has not changed greatly–the majority ages of players are around 20-35. The median age changes slightly from 26 to 25.
We can also notice that there is a jump in the minimum age between 2006 and 2007. The minimum age before 2006 is 18 while the minimum age after 2006 is 19. This is because in 2006, NBA had increased the draft-eligible age from 18 to 19.
Another noticeable point is the maximum age. It has serveral increases and decreases. Every increase is mainly caused by one player. Take the increase from 2015 to 2019 as an example, the eldest player is Vince Carter. He is almost the eldest player in NBA history. The reason why these players are still playing is due to multiple reasons: they are still active players, they are not suffered from serious injuries etc.
data <- Players_bio%>%
filter(Year>=2004)%>%
select(Player,height,weight,Year)%>%
distinct()%>%
drop_na()%>%
as.data.frame(stringsAsFactors = F)%>%
select(height,weight,Year)%>%
dplyr::group_by(Year)%>%
dplyr::summarise(avg_h=mean(height),avg_w=mean(weight))%>%
dplyr::ungroup()%>%
mutate(hw_ratio=avg_h/avg_w)%>%
select(Year,hw_ratio)
ggplot()+
geom_line(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=2)+
geom_point(aes(x=Year,y=hw_ratio),data=data,color="#C9082A",size=4)+
geom_point(aes(x=Year,y=hw_ratio),data=data,color="white",size=2)+
scale_x_continuous(breaks=seq(2004,2019))+
scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
xlab('') +
ylab('') +
theme_minimal()+
ggtitle("Average Height/Weight Ratio Per Season")+
theme(plot.title = element_text(size=17.5,face="bold"),
axis.text.x = element_text(angle = 45, hjust = 1,color = "#000000", size = 11),
axis.text.y = element_text(color = "#000000", size = 11))
This plot presents the average height/weight ratio of the players. It reflects players’ the body shape. There is clear increasing trend of this ratio after 2011.
While the average height and average weight has not changed much for past 15 years(as we can see in the following table), the increase of the height/weight ratio means the weight is relatively decreasing compared with the height, which suggests the players are becoming more and more facile and fast.
data <- Players_bio%>%
filter(Year>=2004)%>%
select(Player,height,weight,Year)%>%
distinct()%>%
drop_na()%>%
as.data.frame(stringsAsFactors = F)%>%
select(height,weight,Year)%>%
dplyr::group_by(Year)%>%
dplyr::summarise(avg_h=round(mean(height),1),avg_w=round(mean(weight),1))%>%
dplyr::ungroup()%>%
column_to_rownames(var="Year")%>%
t()%>%
as.data.frame(stringsAsFactors = FALSE)
pander(data)
| 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | |
|---|---|---|---|---|---|---|---|---|
| avg_h | 201 | 201.1 | 200.8 | 200.6 | 200.7 | 201 | 200.8 | 201.1 |
| avg_w | 220.5 | 221.2 | 220.8 | 220.7 | 220.3 | 221.4 | 221.6 | 223.3 |
| 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|
| avg_h | 200.7 | 200.8 | 200.7 | 200.7 | 200.9 | 200.8 | 200.5 | 200.7 |
| avg_w | 222.4 | 222.5 | 221.5 | 221.2 | 221.2 | 219.6 | 217.3 | 217.9 |
* The first row is the average height data(cm), the second row is the average weight data(pound).
Team_splits %>% select(year, pts, pace) %>% group_by(year) %>% summarise(Pace = mean(pace), Points = mean(pts)) %>%
gather(key = 'type', value = 'value', -year) %>%
ggplot(aes(x = year, y = value)) +
geom_line(color = '#C9082A', size = 2) +
geom_point(color = '#C9082A', size = 4) +
geom_point(color = '#FFFFFF', size = 2) +
#scale_color_manual(values = c('#17408B', '#C9082A')) +
facet_grid(type ~ ., scales = 'free_y') +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
xlab('') +
ylab('') +
ggtitle('Pace and Points Per Game') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11),
axis.text.y = element_text(color = "#000000", size = 11),
strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
strip.background = element_rect(fill = '#17408B', colour = 'white'),
plot.title = element_text(size = 17.5, face = 'bold'),
legend.position = 'none')
There ’s an obvious trend in both Pace (the number of possessions a team uses per game) and PPG (Points Per Game) of NBA games in recent 15 years. The more possessions a team accumulates, the quicker the pace of the game.
We can see that from 2004 to 2013 the pace and PPG are fluctuating around 93 and 98 respectively, but from 2014 these two stats keep growing and especially in 2019 the pace rise to 101 from 98 last year and PPG increases by nearly 6 points more than last season. It’s easy to find a positive associaiton between pace and PPG since the more possessions you have the more chances you can score, though also gives their opponents more chances.
* The formula for pace is: ((Tm Poss + Opp Poss) / (2 x (Tm MP / 5))). The first part of the equation sums Team Possessions and Opponent’s Possessions. The latter half of the equation uses Team Minutes Played, which is the total number of minutes played by each player on the team. StatNBA
Team_splits %>% select(year, ortg, pctWins) %>% group_by(year) %>%
ggplot(aes(x = year, y=ortg, alpha = pctWins, color = pctWins)) +
geom_jitter(size = 2) +
scale_colour_gradient(low = "#8ec5ff", high = "#19293a",breaks
=c(0.2,0.4,0.6,0.8), labels=c("20%","40%","60%","80%"))+
geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2, show.legend = F) +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
guides(alpha=FALSE)+
ggtitle('Average Offensive Rating Per Game') +
xlab('') +
ylab('') +
labs(colour="% Win")+
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11),
axis.text.y = element_text(color = "#000000", size = 11),
plot.title = element_text(size = 17.5, face = 'bold'))
This plot shows average offrtg (offensive rating, a statistic used to measure a team’s offensive performance) of each teams in these 15 years. The color reflects the Win Percentage of each team. The darker the marker is, the more the team wins.
The dashed line is the fitted line of the data. It seems that teams with higher offensive rating(points above the dashed line) tends to have a higher Win Percentage(Darker points).
Offensive Rating shows that the offensive ability of each team started growing from 2013 and reached an unprecedented level in 2018. We are curious about is there any other reasons for such high offensive performance these years except the high pace?
* offrtg = 100x((Points)/(POSS). It measures a team’s points scored per 100 possessions. On a player level, this statistic is team points scored per 100 possessions while he is on court. StatNBA
p1 <- Team_splits %>% select(year, fg3a, fg2a) %>%
gather(key = 'type', value = 'attempt', -c(year)) %>%
group_by(year, type) %>% summarise(attempt = mean(attempt)) %>%
ggplot(aes(x = year, y = attempt, group = year)) +
#geom_boxplot(aes(color = type)) +
#geom_line() +
geom_bar(stat = 'identity', fill = '#C9082A') +
facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`fg2a` = '2 pointer', `fg3a` = '3 pointer'))) +
scale_color_manual(values = c('#17408B', '#C9082A')) +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
xlab('') +
ylab('') +
#ylim(0, 2500) +
ggtitle('Field Goals Attempt') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 15),
axis.text.y = element_text(color = "#000000", size = 15),
strip.text.y = element_text(size = 10, colour = '#FFFFFF', face = 'bold'),
strip.background = element_rect(fill = '#17408B', colour = 'white'),
plot.title = element_text(size = 17.5, face = 'bold'),
legend.position = 'none')
p2 <- Team_splits %>% select(year, pctFG3, pctFG2) %>%
gather(key = 'type', value = 'percentage', -c(year)) %>%
group_by(year, type) %>% summarise(percentage = mean(percentage)) %>%
ggplot(aes(x = year, y = percentage)) +
geom_line(color = '#C9082A', size = 2) +
geom_point(color = '#C9082A', size = 4) +
geom_point(color = '#FFFFFF', size = 2) +
facet_grid(type ~., scales = 'free_y', labeller = as_labeller(c(`pctFG2` = '2 pointer', `pctFG3` = '3 pointer'))) +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
xlab('') +
ylab('') +
ggtitle('Field Goals Percentage') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5,color = "#000000", size = 15),
axis.text.y = element_text(color = "#000000", size = 15),
strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
strip.background = element_rect(fill = '#17408B', colour = 'white'),
plot.title = element_text(size = 17.5, face = 'bold'))
grid.arrange(p1, p2, ncol = 2)
In basketball, a field goal is a basket scored on any shot or tap other than a free throw, worth two or three points depending on the distance of the attempt from the basket. An attempt is counted no matter this shot is scored.
This plot shows the FGA (Field Goal Attempt) and FG% (Field Goal Percentage) for both 2 pointer and 3 pointer of the league average performance(per team per game). Please note that the FG% only relates to the scored shots – they are the percentage of scored shots over all the attempts. The sum of 3 pointer FG% and 2 pointer FG% does not necessarily add up to 100%.
In the left plot, we find that teams in NBA is attempting more and more 3 pointers year by year without decreasing too much 2 pointer attempts. In 2019, FGA for 3 is more than twice of that 15 years ago. Also in 2019, FGA for 3 is beyond 30 and FGA for 2 is below 60, which means in average every three shots in a NBA game ther is one 3 pointer shot in 2019.
The right plot tells the FG% of 2 pointer and 3 pointer from 2004 to 2019. It’s clear that the FG% for 2 keeps growing from 2012 and reached beyond 50% since 2017. The FG% for 3 is fluctuating between 35% and 36% in most years. We can see that teams are trying to make 2 pointers shots more efficient by increasing the FG% of it.
From these two plots, we can see that the strategy of NBA teams to score more is to try more 3 pointers and keep 2 pointers shots more efficient.
Team_shooting$distance <- factor(Team_shooting$distance, levels = unique(Team_shooting$distance))
p1 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, fga, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
ggplot(aes(x = year, y = fga/82, group = year)) +
#geom_boxplot() +
geom_bar(stat = 'identity', fill = '#C9082A') +
facet_grid(distance ~ ., scales = 'free_y') +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
xlab('') +
ylab('') +
#ylim(0,1500) +
ggtitle('Field Goals Attempt by Distance') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 15),
axis.text.y = element_text(color = "#000000", size = 15),
legend.position = 'none',
strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
strip.background = element_rect(fill = '#17408B', colour = 'white'),
plot.title = element_text(size = 17.5, face = 'bold'))
p2 <- Team_shooting %>% filter(distance != 'Back Court Shot') %>% select(distance, pctFG, year) %>% group_by(year, distance) %>% summarise_all(mean) %>%
ggplot(aes(x = year, y = pctFG)) +
geom_line(color = '#C9082A', size = 2) +
geom_point(color = '#C9082A', size = 4) +
geom_point(color = '#FFFFFF', size = 2) +
facet_grid(distance ~ ., scales = 'free_y') +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
xlab('') +
ylab('') +
ggtitle('Field Goals Percentage by Distance') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 15),
axis.text.y = element_text(color = "#000000", size = 15),
legend.position = 'none',
strip.text.y = element_text(size = 15, colour = '#FFFFFF', face = 'bold'),
strip.background = element_rect(fill = '#17408B', colour = 'white'),
plot.title = element_text(size = 17.5, face = 'bold'))
grid.arrange(p1, p2, ncol = 2)
This plot shows FGA and FG% of shots from different region on the court. The distance is how far the shooting spot is from the basket.
Shots beyond 23 feet 9 inches from the basket is 3 pointers and others are 2 pointers. The 24+ ft data are similar with that of the 3 pointer in the plot above. The 2 pointer shots can be decomposed into 3 types – ‘near basket’(Less than 8 ft.), ‘mid-range’(8-16 ft.), ‘long-range’(16-24 ft.).
We can see from the left plot that ‘near basket’ 2 pointers’ FGA is the most among all and it reaches 30 in 2019 which is even more than the sum of other two types. While ‘long-range’ shots keeps going down and ‘mid-range’ remains around 12. Considering the difficulty of making a field goal rises with the distance from the basket, ‘long-range’ shots seems to be less valuable than ‘near basket’ ones. In the right plot, we can see ‘near basket’ shots’ FG% goes far beyond others and reached 58% in 2019 while ‘mid-range’ shots’ FG% also keeps rising.
This may explain how the NBA teams makes it to keeping throwing more 3 pointers and in the meanwhile raise the FG% of 2 pointers. They decrease the attempts to shoot from ‘low efficence’ regions and focus more near the basket.
Team_splits %>% select(year, pctTS, pctWins) %>% group_by(year) %>%
ggplot(aes(x = year, y=pctTS, alpha = pctWins, color = pctWins)) +
geom_jitter(size = 2) +
scale_colour_gradient(low = "#8ec5ff", high = "#19293a",breaks
=c(0.2,0.4,0.6,0.8), labels=c("20%","40%","60%","80%"))+
geom_smooth(linetype = 'longdash', colour = '#C9082A', se = FALSE, size = 2, show.legend = F) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5), legend.position = 'none') +
scale_x_continuous(labels = unique(Team_splits$year), breaks = unique(Team_splits$year)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
ggtitle('Average True Shooting Percentage Per Game') +
guides(alpha=FALSE)+
xlab('') +
ylab('') +
labs(colour="% Win")+
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5,color = "#000000", size = 11),
axis.text.y = element_text(color = "#000000", size = 11),
plot.title = element_text(size = 17.5, face = 'bold'))
TS% (True Shooting Percentage, measures efficiency at shooting the ball) synthesizes field goal percentage, free throw percentage, and three-point field goal percentage instead of take them individually to calculate shooting more accurately. The same as before, the darker the marker is, the more the team wins.
It’s easy to find that the curve of TS% shares the simialr shape of that of offrtg curve and teams at present shoots much more efficiently than 15 years ago.
* TS%=Points/ [2 x (Field Goals Attempted+0.44 x Free Throws Attempted)]. This is a shooting percentage that factors in the value of three-point field goals and free throws in addition to conventional two-point field goals. StatNBA
The following interactive plots are created in shiny app. The link to our shiny app is https://cy2507.shinyapps.io/NBA_15years/. You may click on the link to play with the interactive components.
This part shows analysis of changes in players’ scoring methods and efficiency during last 15 years in NBA. Definitions of terminologies we’ll mention are on the right part.
We choose players who attend more than 30 games in a season (82 games in total), play for more than 15 minutes per game (48 minutes in total), and at least 0.1 3-pointer attempts.
You can also click on the buttons at top left of the plot to show only your intreseted position.
Scoring-Front page
This plot uses 3-pointer and 2-pointer FGA (Field Goal Attempts) as axises and total FGA as markers’ size to demostrate scoring methods of teams in different seasons.
From the plot, we can see that starting from 2012, there is a noticeable increase trend in 3-pointer FGA.
The plot uses 3-pointer and 2-pointer FG% (Field Goal Percentage) as axises and total FGA as markers’ size to demostrate scoring effieciency of teams in different seasons.
We can see that 3-pointer field goal percentage is around the same while the 2-pointer field goal percentage is increasing. The strategy of NBA teams to score more is to keep 2 pointers shots more efficient.
The plot uses 3-pointer and 3-pointer percentage as axises and total FGA as markers’ size to demostrate three pointer performance of teams in different seasons.
The 3-pointer field goal percentage is around 30% to 40%. The strategy of NBA teams to score more is to try more 3 pointers.
The plot uses 2-pointer and 2 pointer percentage as axises and total FGA as markers’ size to demostrate two pointer performance of teams in different seasons.
The 2-pointer field goals are decreasing while the 2-pointer field goals are increasing.
The plot uses TS%(True Shooting Percentage) and FG(Field Goals) as axises and FGA(Field Goals Attempts) as markers’ size to demostrate shooting ability of teams in different seasons.
The TS% has increased around 10% for the past 15 years. Please note that this is quite a huge improvement in shooting ability. Among all the positions, bigs improve the most.
A work by Chao Yin & Zeyu Yang
cy2507@columbia.edu | zy2327@columbia.edu